Skip to content

ci/test_matrix.yml: test with PyPI cuda-toolkit 13.3.0#2140

Merged
rwgk merged 2 commits into
NVIDIA:mainfrom
rwgk:ci_test_matrix_13.3.0
May 27, 2026
Merged

ci/test_matrix.yml: test with PyPI cuda-toolkit 13.3.0#2140
rwgk merged 2 commits into
NVIDIA:mainfrom
rwgk:ci_test_matrix_13.3.0

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented May 26, 2026

Description

Update CI coverage to test against CUDA Toolkit 13.3.0 now that cuda-toolkit 13.3.0 is available on PyPI (posted 2026-05-27 at 07:36 AM PDT).

Similar to the changes under PR #1745 for the CUDA 13.2.0 release, this replaces the CUDA 13.2.1 entries in the pull-request and nightly CI matrices with CUDA 13.3.0, while preserving the existing spread of Python versions, platforms, GPUs, and local-vs-wheel CUDA Toolkit coverage.

While validating the matrix update, the CUDA 13.3.0 local-CTK lanes exposed a redistrib metadata change: CTK 13.3.0 renamed the CCCL component key from cuda_cccl to cccl. The mini-CTK fetch helper now resolves that renamed component so local-CTK jobs still install the CCCL headers required by NVRTC tests.

Changes

  • Update ci/test-matrix.yml from CUDA 13.2.1 to CUDA 13.3.0 for PR and nightly coverage.
  • Add component alias resolution in ci/tools/fetch_ctk_redistrib.py for the cuda_cccl -> cccl redistrib key rename introduced with CTK 13.3.0.
  • Add unit coverage for the renamed CCCL component in ci/tools/tests/test_fetch_ctk_redistrib.py.

@rwgk rwgk added this to the cuda.bindings 13.3.0 & 12.9.7 milestone May 26, 2026
@rwgk rwgk self-assigned this May 26, 2026
@rwgk rwgk added P0 High priority - Must do! CI/CD CI/CD infrastructure labels May 26, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 26, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module cuda.pathfinder Everything related to the cuda.pathfinder module labels May 26, 2026
@rwgk rwgk mentioned this pull request May 26, 2026
@rwgk rwgk force-pushed the ci_test_matrix_13.3.0 branch from 0bd613f to 836c875 Compare May 27, 2026 03:25
@rwgk rwgk changed the title ci/test_matrix.yml: test with cuda-toolkit 13.3.0 [ON HOLD until the new cuda-toolkit release becomes available] ci/test_matrix.yml: test with PyPI cuda-toolkit 13.3.0 May 27, 2026
@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 27, 2026

/ok to test

@github-actions

This comment has been minimized.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 27, 2026

PR 2140: First CI analysis of 13.3.0 failures

Workflow run inspected:

Summary

The failures are not affecting all CUDA 13.3.0 test lanes. The CUDA 13.3.0
wheel-based test lanes are passing. The failures are concentrated in CUDA
13.3.0 LOCAL_CTK: '1' lanes, i.e. lanes that assemble and use a local mini
CUDA Toolkit via .github/actions/fetch_ctk.

The common failure mode is NVRTC compilation failing because CUDA C++/CCCL
headers are missing from the local mini-CTK:

NVRTC_ERROR_COMPILATION
catastrophic error: cannot open source file "cuda/atomic"

Other failing tests show the same class of missing-header problem:

catastrophic error: cannot open source file "cuda/std/complex"

and on Windows example tests:

cuda_toolkit\include\cooperative_groups/details/info.h(50): catastrophic error:
cannot open source file "nv/target"

Observed CI Pattern

Passing:

  • Build jobs for CUDA 13.3.0 pass across Linux, Linux aarch64, and Windows.
  • CUDA 13.3.0 (wheels) test lanes pass.

Failing:

  • CUDA 13.3.0 (local) test lanes fail across Linux, Linux aarch64, and
    Windows.

Representative failed jobs include:

  • Test linux-64 / Python 3.11, CUDA 13.3.0 (local), GPU l4
  • Test linux-aarch64 / Python 3.11, CUDA 13.3.0 (local), GPU l4
  • Test win-64 / Python 3.10, CUDA 13.3.0 (local), GPU rtxpro6000 (TCC)

Root Cause

The fetch_ctk action still requests the CCCL redistrib component under the old
component name:

cuda_cccl

For CUDA 13.3.0, the redistrib metadata no longer uses that key. It uses:

cccl

Evidence from redistrib metadata:

redistrib_13.2.1.json:
cuda_cccl/linux-x86_64/cuda_cccl-linux-x86_64-13.2.75-archive.tar.xz

redistrib_13.3.0.json:
cccl/linux-x86_64/cccl-linux-x86_64-13.3.3.3.1-archive.tar.xz

The 13.3.0 logs show the helper filtering out the old component name:

Skipping unsupported CTK component 'cuda_cccl' for host-platform 'linux-64'
Skipping unsupported CTK component 'cuda_cccl' for host-platform 'win-64'

After that, the local mini-CTK is assembled without CCCL headers, so tests that
compile CUDA C++ code through NVRTC fail when they include headers such as:

  • cuda/atomic
  • cuda/std/complex
  • nv/target

Why Wheel Lanes Pass

The wheel lanes do not rely on this locally assembled mini-CTK in the same way.
They install CUDA runtime/compiler-related Python wheels, so they do not hit the
missing cccl component in .github/actions/fetch_ctk.

This is why the apparent failure is broad across local 13.3.0 tests, but not
across all 13.3.0 tests.

Suggested Fix

Update the mini-CTK component resolution to handle the redistrib component
rename from cuda_cccl to cccl.

Reasonable approaches:

  1. Teach ci/tools/fetch_ctk_redistrib.py to alias cuda_cccl to cccl when
    the metadata contains cccl but not cuda_cccl.
  2. Or include both cuda_cccl,cccl in the default component list and rely on
    metadata filtering to keep only the one that exists.

The first option is cleaner because existing workflow inputs can keep using the
logical old name while the helper adapts to the redistrib metadata shape.

After the fix, rerun a small representative subset first:

  • one Linux CUDA 13.3.0 (local) lane
  • one Windows CUDA 13.3.0 (local) lane

If those pass, rerun the full PR CI.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented May 27, 2026

/ok to test

@rwgk rwgk marked this pull request as ready for review May 27, 2026 17:46
@rwgk rwgk requested review from Andy-Jost, kkraus14 and rparolin May 27, 2026 17:48
@rwgk rwgk merged commit 88363f8 into NVIDIA:main May 27, 2026
99 checks passed
@rwgk rwgk deleted the ci_test_matrix_13.3.0 branch May 27, 2026 23:11
@github-actions

This comment has been minimized.

1 similar comment
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD CI/CD infrastructure cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module cuda.pathfinder Everything related to the cuda.pathfinder module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants